{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Main.Optimizers"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "using Knet\n",
    "include(\"optimizers.jl\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "This example demonstrates the usage of stochastic gradient descent(sgd) based optimization methods. We train LeNet model on MNIST dataset similar to `lenet.jl`.\n",
       "\n",
       "You can run the demo using `julia optimizers.jl`.  Use `julia optimizers.jl --help` for a list of options. By default the [LeNet](http://yann.lecun.com/exdb/lenet) convolutional neural network model will be trained using sgd for 10 epochs. At the end of the training accuracy for the training and test sets for each epoch will be printed  and optimized parameters will be returned.\n"
      ],
      "text/plain": [
       "  This example demonstrates the usage of stochastic gradient descent(sgd) based optimization methods. We train LeNet model on MNIST dataset similar\n",
       "  to \u001b[36mlenet.jl\u001b[39m.\n",
       "\n",
       "  You can run the demo using \u001b[36mjulia optimizers.jl\u001b[39m. Use \u001b[36mjulia optimizers.jl --help\u001b[39m for a list of options. By default the LeNet\n",
       "  (http://yann.lecun.com/exdb/lenet) convolutional neural network model will be trained using sgd for 10 epochs. At the end of the training accuracy\n",
       "  for the training and test sets for each epoch will be printed and optimized parameters will be returned."
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@doc Optimizers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usage: <PROGRAM> [--seed SEED] [--batchsize BATCHSIZE] [--lr LR]\n",
      "                 [--eps EPS] [--gamma GAMMA] [--rho RHO]\n",
      "                 [--beta1 BETA1] [--beta2 BETA2] [--epochs EPOCHS]\n",
      "                 [--iters ITERS] [--optim OPTIM] [--atype ATYPE]\n",
      "\n",
      "optimizers.jl (c) Ozan Arkan Can and Deniz Yuret, 2016. Demonstration\n",
      "of different sgd based optimization methods using LeNet.\n",
      "\n",
      "optional arguments:\n",
      "  --seed SEED           random number seed: use a nonnegative int for\n",
      "                        repeatable results (type: Int64, default: -1)\n",
      "  --batchsize BATCHSIZE\n",
      "                        minibatch size (type: Int64, default: 100)\n",
      "  --lr LR               learning rate (type: Float64, default: 0.1)\n",
      "  --eps EPS             epsilon parameter used in adam, adagrad,\n",
      "                        adadelta (type: Float64, default: 1.0e-6)\n",
      "  --gamma GAMMA         gamma parameter used in momentum and nesterov\n",
      "                        (type: Float64, default: 0.95)\n",
      "  --rho RHO             rho parameter used in adadelta and rmsprop\n",
      "                        (type: Float64, default: 0.9)\n",
      "  --beta1 BETA1         beta1 parameter used in adam (type: Float64,\n",
      "                        default: 0.9)\n",
      "  --beta2 BETA2         beta2 parameter used in adam (type: Float64,\n",
      "                        default: 0.95)\n",
      "  --epochs EPOCHS       number of epochs for training (type: Int64,\n",
      "                        default: 10)\n",
      "  --iters ITERS         number of updates for training (type: Int64,\n",
      "                        default: 6000)\n",
      "  --optim OPTIM         optimization method (SGD, Momentum, Nesterov,\n",
      "                        Adagrad, Adadelta, Rmsprop, Adam) (default:\n",
      "                        \"SGD\")\n",
      "  --atype ATYPE         array and float type to use (default:\n",
      "                        \"KnetArray{Float32}\")\n",
      "\n"
     ]
    }
   ],
   "source": [
    "Optimizers.main(\"--help\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "```\n",
       "SGD(;lr=0.001,gclip=0)\n",
       "update!(w,g,p::SGD)\n",
       "update!(w,g;lr=0.001)\n",
       "```\n",
       "\n",
       "Container for parameters of the Stochastic gradient descent (SGD) optimization algorithm used by [`update!`](@ref).\n",
       "\n",
       "SGD is an optimization technique to minimize an objective function by updating its weights in the opposite direction of their gradient. The learning rate (lr) determines the size of the step.  SGD updates the weights with the following formula:\n",
       "\n",
       "```\n",
       "w = w - lr * g\n",
       "```\n",
       "\n",
       "where `w` is a weight array, `g` is the gradient of the loss function w.r.t `w` and `lr` is the learning rate.\n",
       "\n",
       "If `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`.  If `gclip==0` no scaling takes place.\n",
       "\n",
       "SGD is used by default if no algorithm is specified in the two argument version of `update!`[@ref].\n"
      ],
      "text/plain": [
       "\u001b[36m  SGD(;lr=0.001,gclip=0)\u001b[39m\n",
       "\u001b[36m  update!(w,g,p::SGD)\u001b[39m\n",
       "\u001b[36m  update!(w,g;lr=0.001)\u001b[39m\n",
       "\n",
       "  Container for parameters of the Stochastic gradient descent (SGD) optimization algorithm used by \u001b[36mupdate!\u001b[39m.\n",
       "\n",
       "  SGD is an optimization technique to minimize an objective function by updating its weights in the opposite direction of their gradient. The\n",
       "  learning rate (lr) determines the size of the step. SGD updates the weights with the following formula:\n",
       "\n",
       "\u001b[36m  w = w - lr * g\u001b[39m\n",
       "\n",
       "  where \u001b[36mw\u001b[39m is a weight array, \u001b[36mg\u001b[39m is the gradient of the loss function w.r.t \u001b[36mw\u001b[39m and \u001b[36mlr\u001b[39m is the learning rate.\n",
       "\n",
       "  If \u001b[36mnorm(g) > gclip > 0\u001b[39m, \u001b[36mg\u001b[39m is scaled so that its norm is equal to \u001b[36mgclip\u001b[39m. If \u001b[36mgclip==0\u001b[39m no scaling takes place.\n",
       "\n",
       "  SGD is used by default if no algorithm is specified in the two argument version of \u001b[36mupdate!\u001b[39m[@ref]."
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@doc SGD"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "optimizers.jl (c) Ozan Arkan Can and Deniz Yuret, 2016. Demonstration of different sgd based optimization methods using LeNet.\n",
      "opts=(:atype, \"KnetArray{Float32}\")(:gamma, 0.95)(:rho, 0.9)(:eps, 1.0e-6)(:batchsize, 100)(:beta1, 0.9)(:iters, 6000)(:beta2, 0.95)(:epochs, 10)(:optim, \"SGD\")(:lr, 0.1)(:seed, -1)\n",
      "(:epoch, 0, :trn, 0.1017, :tst, 0.1019)\n",
      "(:epoch, 1, :trn, 0.96335, :tst, 0.9657)\n",
      "(:epoch, 2, :trn, 0.97825, :tst, 0.9778)\n",
      "(:epoch, 3, :trn, 0.9844, :tst, 0.9821)\n",
      "(:epoch, 4, :trn, 0.988, :tst, 0.9853)\n",
      "(:epoch, 5, :trn, 0.9904166666666666, :tst, 0.9862)\n",
      "(:epoch, 6, :trn, 0.9917666666666667, :tst, 0.9873)\n",
      "(:epoch, 7, :trn, 0.9926333333333334, :tst, 0.9873)\n",
      "(:epoch, 8, :trn, 0.9932666666666666, :tst, 0.9879)\n",
      "(:epoch, 9, :trn, 0.9938, :tst, 0.9884)\n",
      "(:epoch, 10, :trn, 0.9946, :tst, 0.9883)\n",
      " 30.585886 seconds (11.26 M allocations: 4.346 GiB, 1.08% gc time)\n"
     ]
    }
   ],
   "source": [
    "Optimizers.main(\"\"); # Tries SGD by default"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "```\n",
       "Momentum(;lr=0.001, gclip=0, gamma=0.9)\n",
       "update!(w,g,p::Momentum)\n",
       "```\n",
       "\n",
       "Container for parameters of the Momentum optimization algorithm used by [`update!`](@ref).\n",
       "\n",
       "The Momentum method tries to accelerate SGD by adding a velocity term to the update.  This also decreases the oscillation between successive steps. It updates the weights with the following formulas:\n",
       "\n",
       "```\n",
       "velocity = gamma * velocity + lr * g\n",
       "w = w - velocity\n",
       "```\n",
       "\n",
       "where `w` is a weight array, `g` is the gradient of the objective function w.r.t `w`, `lr` is the learning rate, `gamma` is the momentum parameter, `velocity` is an array with the same size and type of `w` and holds the accelerated gradients.\n",
       "\n",
       "If `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`.  If `gclip==0` no scaling takes place.\n",
       "\n",
       "Reference: [Qian, N. (1999)](http://doi.org/10.1016/S0893-6080(98)00116-6). On the momentum term in gradient descent learning algorithms.  Neural Networks : The Official Journal of the International Neural Network Society, 12(1), 145–151.\n"
      ],
      "text/plain": [
       "\u001b[36m  Momentum(;lr=0.001, gclip=0, gamma=0.9)\u001b[39m\n",
       "\u001b[36m  update!(w,g,p::Momentum)\u001b[39m\n",
       "\n",
       "  Container for parameters of the Momentum optimization algorithm used by \u001b[36mupdate!\u001b[39m.\n",
       "\n",
       "  The Momentum method tries to accelerate SGD by adding a velocity term to the update. This also decreases the oscillation between successive steps.\n",
       "  It updates the weights with the following formulas:\n",
       "\n",
       "\u001b[36m  velocity = gamma * velocity + lr * g\u001b[39m\n",
       "\u001b[36m  w = w - velocity\u001b[39m\n",
       "\n",
       "  where \u001b[36mw\u001b[39m is a weight array, \u001b[36mg\u001b[39m is the gradient of the objective function w.r.t \u001b[36mw\u001b[39m, \u001b[36mlr\u001b[39m is the learning rate, \u001b[36mgamma\u001b[39m is the momentum parameter, \u001b[36mvelocity\u001b[39m\n",
       "  is an array with the same size and type of \u001b[36mw\u001b[39m and holds the accelerated gradients.\n",
       "\n",
       "  If \u001b[36mnorm(g) > gclip > 0\u001b[39m, \u001b[36mg\u001b[39m is scaled so that its norm is equal to \u001b[36mgclip\u001b[39m. If \u001b[36mgclip==0\u001b[39m no scaling takes place.\n",
       "\n",
       "  Reference: Qian, N. (1999) (http://doi.org/10.1016/S0893-6080(98)00116-6). On the momentum term in gradient descent learning algorithms. Neural\n",
       "  Networks : The Official Journal of the International Neural Network Society, 12(1), 145–151."
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@doc Momentum"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "optimizers.jl (c) Ozan Arkan Can and Deniz Yuret, 2016. Demonstration of different sgd based optimization methods using LeNet.\n",
      "opts=(:atype, \"KnetArray{Float32}\")(:gamma, 0.99)(:rho, 0.9)(:eps, 1.0e-6)(:batchsize, 100)(:beta1, 0.9)(:iters, 6000)(:beta2, 0.95)(:epochs, 10)(:optim, \"Momentum\")(:lr, 0.005)(:seed, -1)\n",
      "(:epoch, 0, :trn, 0.10626666666666666, :tst, 0.1099)\n",
      "(:epoch, 1, :trn, 0.9788666666666667, :tst, 0.9787)\n",
      "(:epoch, 2, :trn, 0.9856, :tst, 0.9845)\n",
      "(:epoch, 3, :trn, 0.9864166666666667, :tst, 0.9833)\n",
      "(:epoch, 4, :trn, 0.9870833333333333, :tst, 0.9842)\n",
      "(:epoch, 5, :trn, 0.9899166666666667, :tst, 0.9834)\n",
      "(:epoch, 6, :trn, 0.9940333333333333, :tst, 0.9869)\n",
      "(:epoch, 7, :trn, 0.99255, :tst, 0.986)\n",
      "(:epoch, 8, :trn, 0.9959333333333333, :tst, 0.9879)\n",
      "(:epoch, 9, :trn, 0.9962166666666666, :tst, 0.9881)\n",
      "(:epoch, 10, :trn, 0.9966666666666667, :tst, 0.9906)\n",
      " 31.861608 seconds (11.83 M allocations: 4.375 GiB, 1.22% gc time)\n"
     ]
    }
   ],
   "source": [
    "Optimizers.main(\"--optim Momentum --lr 0.005 --gamma 0.99\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "```\n",
       "Nesterov(; lr=0.001, gclip=0, gamma=0.9)\n",
       "update!(w,g,p::Momentum)\n",
       "```\n",
       "\n",
       "Container for parameters of Nesterov's momentum optimization algorithm used by [`update!`](@ref).\n",
       "\n",
       "It is similar to standard [`Momentum`](@ref) but with a slightly different update rule:\n",
       "\n",
       "```\n",
       "velocity = gamma * velocity_old - lr * g\n",
       "w = w_old - velocity_old + (1+gamma) * velocity\n",
       "```\n",
       "\n",
       "where `w` is a weight array, `g` is the gradient of the objective function w.r.t `w`, `lr` is the learning rate, `gamma` is the momentum parameter, `velocity` is an array with the same size and type of `w` and holds the accelerated gradients.\n",
       "\n",
       "If `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`.  If `gclip == 0` no scaling takes place.\n",
       "\n",
       "Reference Implementation : [Yoshua Bengio, Nicolas Boulanger-Lewandowski and Razvan P ascanu](https://arxiv.org/pdf/1212.0901.pdf)\n"
      ],
      "text/plain": [
       "\u001b[36m  Nesterov(; lr=0.001, gclip=0, gamma=0.9)\u001b[39m\n",
       "\u001b[36m  update!(w,g,p::Momentum)\u001b[39m\n",
       "\n",
       "  Container for parameters of Nesterov's momentum optimization algorithm used by \u001b[36mupdate!\u001b[39m.\n",
       "\n",
       "  It is similar to standard \u001b[36mMomentum\u001b[39m but with a slightly different update rule:\n",
       "\n",
       "\u001b[36m  velocity = gamma * velocity_old - lr * g\u001b[39m\n",
       "\u001b[36m  w = w_old - velocity_old + (1+gamma) * velocity\u001b[39m\n",
       "\n",
       "  where \u001b[36mw\u001b[39m is a weight array, \u001b[36mg\u001b[39m is the gradient of the objective function w.r.t \u001b[36mw\u001b[39m, \u001b[36mlr\u001b[39m is the learning rate, \u001b[36mgamma\u001b[39m is the momentum parameter, \u001b[36mvelocity\u001b[39m\n",
       "  is an array with the same size and type of \u001b[36mw\u001b[39m and holds the accelerated gradients.\n",
       "\n",
       "  If \u001b[36mnorm(g) > gclip > 0\u001b[39m, \u001b[36mg\u001b[39m is scaled so that its norm is equal to \u001b[36mgclip\u001b[39m. If \u001b[36mgclip == 0\u001b[39m no scaling takes place.\n",
       "\n",
       "  Reference Implementation : Yoshua Bengio, Nicolas Boulanger-Lewandowski and Razvan P ascanu (https://arxiv.org/pdf/1212.0901.pdf)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@doc Nesterov"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "optimizers.jl (c) Ozan Arkan Can and Deniz Yuret, 2016. Demonstration of different sgd based optimization methods using LeNet.\n",
      "opts=(:atype, \"KnetArray{Float32}\")(:gamma, 0.99)(:rho, 0.9)(:eps, 1.0e-6)(:batchsize, 100)(:beta1, 0.9)(:iters, 6000)(:beta2, 0.95)(:epochs, 10)(:optim, \"Nesterov\")(:lr, 0.005)(:seed, -1)\n",
      "(:epoch, 0, :trn, 0.06255, :tst, 0.0574)\n",
      "(:epoch, 1, :trn, 0.9762833333333333, :tst, 0.9777)\n",
      "(:epoch, 2, :trn, 0.9874666666666667, :tst, 0.9855)\n",
      "(:epoch, 3, :trn, 0.9898166666666667, :tst, 0.9851)\n",
      "(:epoch, 4, :trn, 0.99075, :tst, 0.9868)\n",
      "(:epoch, 5, :trn, 0.9943166666666666, :tst, 0.9893)\n",
      "(:epoch, 6, :trn, 0.9946666666666667, :tst, 0.9901)\n",
      "(:epoch, 7, :trn, 0.9958333333333333, :tst, 0.9892)\n",
      "(:epoch, 8, :trn, 0.9976833333333334, :tst, 0.9909)\n",
      "(:epoch, 9, :trn, 0.9967666666666667, :tst, 0.9894)\n",
      "(:epoch, 10, :trn, 0.9969666666666667, :tst, 0.99)\n",
      " 32.278787 seconds (11.84 M allocations: 4.371 GiB, 1.26% gc time)\n"
     ]
    }
   ],
   "source": [
    "Optimizers.main(\"--optim Nesterov --lr 0.005 --gamma 0.99\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "```\n",
       "Adagrad(;lr=0.1, gclip=0, eps=1e-6)\n",
       "update!(w,g,p::Adagrad)\n",
       "```\n",
       "\n",
       "Container for parameters of the Adagrad optimization algorithm used by [`update!`](@ref).\n",
       "\n",
       "Adagrad is one of the methods that adapts the learning rate to each of the weights.  It stores the sum of the squares of the gradients to scale the learning rate.  The learning rate is adapted for each weight by the value of current gradient divided by the accumulated gradients. Hence, the learning rate is greater for the parameters where the accumulated gradients are small and the learning rate is small if the accumulated gradients are large. It updates the weights with the following formulas:\n",
       "\n",
       "```\n",
       "G = G + g .^ 2\n",
       "w = w - g .* lr ./ sqrt(G + eps)\n",
       "```\n",
       "\n",
       "where `w` is the weight, `g` is the gradient of the objective function w.r.t `w`, `lr` is the learning rate, `G` is an array with the same size and type of `w` and holds the sum of the squares of the gradients. `eps` is a small constant to prevent a zero value in the denominator.\n",
       "\n",
       "If `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`.  If `gclip==0` no scaling takes place.\n",
       "\n",
       "Reference: [Duchi, J., Hazan, E., & Singer, Y. (2011)](http://jmlr.org/papers/v12/duchi11a.html). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121–2159.\n"
      ],
      "text/plain": [
       "\u001b[36m  Adagrad(;lr=0.1, gclip=0, eps=1e-6)\u001b[39m\n",
       "\u001b[36m  update!(w,g,p::Adagrad)\u001b[39m\n",
       "\n",
       "  Container for parameters of the Adagrad optimization algorithm used by \u001b[36mupdate!\u001b[39m.\n",
       "\n",
       "  Adagrad is one of the methods that adapts the learning rate to each of the weights. It stores the sum of the squares of the gradients to scale the\n",
       "  learning rate. The learning rate is adapted for each weight by the value of current gradient divided by the accumulated gradients. Hence, the\n",
       "  learning rate is greater for the parameters where the accumulated gradients are small and the learning rate is small if the accumulated gradients\n",
       "  are large. It updates the weights with the following formulas:\n",
       "\n",
       "\u001b[36m  G = G + g .^ 2\u001b[39m\n",
       "\u001b[36m  w = w - g .* lr ./ sqrt(G + eps)\u001b[39m\n",
       "\n",
       "  where \u001b[36mw\u001b[39m is the weight, \u001b[36mg\u001b[39m is the gradient of the objective function w.r.t \u001b[36mw\u001b[39m, \u001b[36mlr\u001b[39m is the learning rate, \u001b[36mG\u001b[39m is an array with the same size and type of \u001b[36mw\u001b[39m\n",
       "  and holds the sum of the squares of the gradients. \u001b[36meps\u001b[39m is a small constant to prevent a zero value in the denominator.\n",
       "\n",
       "  If \u001b[36mnorm(g) > gclip > 0\u001b[39m, \u001b[36mg\u001b[39m is scaled so that its norm is equal to \u001b[36mgclip\u001b[39m. If \u001b[36mgclip==0\u001b[39m no scaling takes place.\n",
       "\n",
       "  Reference: Duchi, J., Hazan, E., & Singer, Y. (2011) (http://jmlr.org/papers/v12/duchi11a.html). Adaptive Subgradient Methods for Online Learning\n",
       "  and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121–2159."
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@doc Adagrad"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "optimizers.jl (c) Ozan Arkan Can and Deniz Yuret, 2016. Demonstration of different sgd based optimization methods using LeNet.\n",
      "opts=(:atype, \"KnetArray{Float32}\")(:gamma, 0.95)(:rho, 0.9)(:eps, 1.0e-6)(:batchsize, 100)(:beta1, 0.9)(:iters, 6000)(:beta2, 0.95)(:epochs, 10)(:optim, \"Adagrad\")(:lr, 0.01)(:seed, -1)\n",
      "(:epoch, 0, :trn, 0.10341666666666667, :tst, 0.1003)\n",
      "(:epoch, 1, :trn, 0.9788166666666667, :tst, 0.9795)\n",
      "(:epoch, 2, :trn, 0.9867666666666667, :tst, 0.9853)\n",
      "(:epoch, 3, :trn, 0.9912666666666666, :tst, 0.9891)\n",
      "(:epoch, 4, :trn, 0.9936, :tst, 0.99)\n",
      "(:epoch, 5, :trn, 0.995, :tst, 0.9906)\n",
      "(:epoch, 6, :trn, 0.9957833333333334, :tst, 0.9909)\n",
      "(:epoch, 7, :trn, 0.9966333333333334, :tst, 0.9911)\n",
      "(:epoch, 8, :trn, 0.997, :tst, 0.9913)\n",
      "(:epoch, 9, :trn, 0.9974333333333333, :tst, 0.9912)\n",
      "(:epoch, 10, :trn, 0.9978, :tst, 0.9908)\n",
      " 33.680254 seconds (13.07 M allocations: 4.410 GiB, 1.20% gc time)\n"
     ]
    }
   ],
   "source": [
    "Optimizers.main(\"--optim Adagrad --lr 0.01 --eps 1e-6\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "```\n",
       "Adadelta(;lr=0.01, gclip=0, rho=0.9, eps=1e-6)\n",
       "update!(w,g,p::Adadelta)\n",
       "```\n",
       "\n",
       "Container for parameters of the Adadelta optimization algorithm used by [`update!`](@ref).\n",
       "\n",
       "Adadelta is an extension of Adagrad that tries to prevent the decrease of the learning rates to zero as training progresses. It scales the learning rate based on the accumulated gradients like Adagrad and holds the acceleration term like Momentum. It updates the weights with the following formulas:\n",
       "\n",
       "```\n",
       "G = (1-rho) * g .^ 2 + rho * G\n",
       "update = g .* sqrt(delta + eps) ./ sqrt(G + eps)\n",
       "w = w - lr * update\n",
       "delta = rho * delta + (1-rho) * update .^ 2\n",
       "```\n",
       "\n",
       "where `w` is the weight, `g` is the gradient of the objective function w.r.t `w`, `lr` is the learning rate, `G` is an array with the same size and type of `w` and holds the sum of the squares of the gradients. `eps` is a small constant to prevent a zero value in the denominator.  `rho` is the momentum parameter and `delta` is an array with the same size and type of `w` and holds the sum of the squared updates.\n",
       "\n",
       "If `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`.  If `gclip==0` no scaling takes place.\n",
       "\n",
       "Reference: [Zeiler, M. D. (2012)](http://arxiv.org/abs/1212.5701). ADADELTA: An Adaptive Learning Rate Method.\n"
      ],
      "text/plain": [
       "\u001b[36m  Adadelta(;lr=0.01, gclip=0, rho=0.9, eps=1e-6)\u001b[39m\n",
       "\u001b[36m  update!(w,g,p::Adadelta)\u001b[39m\n",
       "\n",
       "  Container for parameters of the Adadelta optimization algorithm used by \u001b[36mupdate!\u001b[39m.\n",
       "\n",
       "  Adadelta is an extension of Adagrad that tries to prevent the decrease of the learning rates to zero as training progresses. It scales the learning\n",
       "  rate based on the accumulated gradients like Adagrad and holds the acceleration term like Momentum. It updates the weights with the following\n",
       "  formulas:\n",
       "\n",
       "\u001b[36m  G = (1-rho) * g .^ 2 + rho * G\u001b[39m\n",
       "\u001b[36m  update = g .* sqrt(delta + eps) ./ sqrt(G + eps)\u001b[39m\n",
       "\u001b[36m  w = w - lr * update\u001b[39m\n",
       "\u001b[36m  delta = rho * delta + (1-rho) * update .^ 2\u001b[39m\n",
       "\n",
       "  where \u001b[36mw\u001b[39m is the weight, \u001b[36mg\u001b[39m is the gradient of the objective function w.r.t \u001b[36mw\u001b[39m, \u001b[36mlr\u001b[39m is the learning rate, \u001b[36mG\u001b[39m is an array with the same size and type of \u001b[36mw\u001b[39m\n",
       "  and holds the sum of the squares of the gradients. \u001b[36meps\u001b[39m is a small constant to prevent a zero value in the denominator. \u001b[36mrho\u001b[39m is the momentum\n",
       "  parameter and \u001b[36mdelta\u001b[39m is an array with the same size and type of \u001b[36mw\u001b[39m and holds the sum of the squared updates.\n",
       "\n",
       "  If \u001b[36mnorm(g) > gclip > 0\u001b[39m, \u001b[36mg\u001b[39m is scaled so that its norm is equal to \u001b[36mgclip\u001b[39m. If \u001b[36mgclip==0\u001b[39m no scaling takes place.\n",
       "\n",
       "  Reference: Zeiler, M. D. (2012) (http://arxiv.org/abs/1212.5701). ADADELTA: An Adaptive Learning Rate Method."
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@doc Adadelta"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "optimizers.jl (c) Ozan Arkan Can and Deniz Yuret, 2016. Demonstration of different sgd based optimization methods using LeNet.\n",
      "opts=(:atype, \"KnetArray{Float32}\")(:gamma, 0.95)(:rho, 0.9)(:eps, 1.0e-6)(:batchsize, 100)(:beta1, 0.9)(:iters, 6000)(:beta2, 0.95)(:epochs, 10)(:optim, \"Adadelta\")(:lr, 0.35)(:seed, -1)\n",
      "(:epoch, 0, :trn, 0.07245, :tst, 0.0725)\n",
      "(:epoch, 1, :trn, 0.9737333333333333, :tst, 0.9754)\n",
      "(:epoch, 2, :trn, 0.9836166666666667, :tst, 0.9834)\n",
      "(:epoch, 3, :trn, 0.9884833333333334, :tst, 0.9859)\n",
      "(:epoch, 4, :trn, 0.9905166666666667, :tst, 0.9876)\n",
      "(:epoch, 5, :trn, 0.99355, :tst, 0.9898)\n",
      "(:epoch, 6, :trn, 0.9953333333333333, :tst, 0.9902)\n",
      "(:epoch, 7, :trn, 0.9960666666666667, :tst, 0.9919)\n",
      "(:epoch, 8, :trn, 0.9963833333333333, :tst, 0.991)\n",
      "(:epoch, 9, :trn, 0.9965333333333334, :tst, 0.9907)\n",
      "(:epoch, 10, :trn, 0.9973333333333333, :tst, 0.9914)\n",
      " 36.130899 seconds (13.88 M allocations: 4.422 GiB, 1.03% gc time)\n"
     ]
    }
   ],
   "source": [
    "Optimizers.main(\"--optim Adadelta --lr 0.35 --rho 0.9 --eps 1e-6\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "```\n",
       "Rmsprop(;lr=0.001, gclip=0, rho=0.9, eps=1e-6)\n",
       "update!(w,g,p::Rmsprop)\n",
       "```\n",
       "\n",
       "Container for parameters of the Rmsprop optimization algorithm used by [`update!`](@ref).\n",
       "\n",
       "Rmsprop scales the learning rates by dividing the root mean squared of the gradients. It updates the weights with the following formula:\n",
       "\n",
       "```\n",
       "G = (1-rho) * g .^ 2 + rho * G\n",
       "w = w - lr * g ./ sqrt(G + eps)\n",
       "```\n",
       "\n",
       "where `w` is the weight, `g` is the gradient of the objective function w.r.t `w`, `lr` is the learning rate, `G` is an array with the same size and type of `w` and holds the sum of the squares of the gradients. `eps` is a small constant to prevent a zero value in the denominator.  `rho` is the momentum parameter and `delta` is an array with the same size and type of `w` and holds the sum of the squared updates.\n",
       "\n",
       "If `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`.  If `gclip==0` no scaling takes place.\n",
       "\n",
       "Reference: [Tijmen Tieleman and Geoffrey Hinton (2012)](https://dirtysalt.github.io/images/nn-class-lec6.pdf). \"Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.\"  COURSERA: Neural Networks for Machine Learning 4.2.\n"
      ],
      "text/plain": [
       "\u001b[36m  Rmsprop(;lr=0.001, gclip=0, rho=0.9, eps=1e-6)\u001b[39m\n",
       "\u001b[36m  update!(w,g,p::Rmsprop)\u001b[39m\n",
       "\n",
       "  Container for parameters of the Rmsprop optimization algorithm used by \u001b[36mupdate!\u001b[39m.\n",
       "\n",
       "  Rmsprop scales the learning rates by dividing the root mean squared of the gradients. It updates the weights with the following formula:\n",
       "\n",
       "\u001b[36m  G = (1-rho) * g .^ 2 + rho * G\u001b[39m\n",
       "\u001b[36m  w = w - lr * g ./ sqrt(G + eps)\u001b[39m\n",
       "\n",
       "  where \u001b[36mw\u001b[39m is the weight, \u001b[36mg\u001b[39m is the gradient of the objective function w.r.t \u001b[36mw\u001b[39m, \u001b[36mlr\u001b[39m is the learning rate, \u001b[36mG\u001b[39m is an array with the same size and type of \u001b[36mw\u001b[39m\n",
       "  and holds the sum of the squares of the gradients. \u001b[36meps\u001b[39m is a small constant to prevent a zero value in the denominator. \u001b[36mrho\u001b[39m is the momentum\n",
       "  parameter and \u001b[36mdelta\u001b[39m is an array with the same size and type of \u001b[36mw\u001b[39m and holds the sum of the squared updates.\n",
       "\n",
       "  If \u001b[36mnorm(g) > gclip > 0\u001b[39m, \u001b[36mg\u001b[39m is scaled so that its norm is equal to \u001b[36mgclip\u001b[39m. If \u001b[36mgclip==0\u001b[39m no scaling takes place.\n",
       "\n",
       "  Reference: Tijmen Tieleman and Geoffrey Hinton (2012) (https://dirtysalt.github.io/images/nn-class-lec6.pdf). \"Lecture 6.5-rmsprop: Divide the\n",
       "  gradient by a running average of its recent magnitude.\" COURSERA: Neural Networks for Machine Learning 4.2."
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@doc Rmsprop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "optimizers.jl (c) Ozan Arkan Can and Deniz Yuret, 2016. Demonstration of different sgd based optimization methods using LeNet.\n",
      "opts=(:atype, \"KnetArray{Float32}\")(:gamma, 0.95)(:rho, 0.9)(:eps, 1.0e-6)(:batchsize, 100)(:beta1, 0.9)(:iters, 6000)(:beta2, 0.95)(:epochs, 10)(:optim, \"Rmsprop\")(:lr, 0.001)(:seed, -1)\n",
      "(:epoch, 0, :trn, 0.09513333333333333, :tst, 0.0929)\n",
      "(:epoch, 1, :trn, 0.9816666666666667, :tst, 0.9804)\n",
      "(:epoch, 2, :trn, 0.9891, :tst, 0.988)\n",
      "(:epoch, 3, :trn, 0.9922, :tst, 0.9892)\n",
      "(:epoch, 4, :trn, 0.9942666666666666, :tst, 0.9911)\n",
      "(:epoch, 5, :trn, 0.9955, :tst, 0.9914)\n",
      "(:epoch, 6, :trn, 0.9968833333333333, :tst, 0.9923)\n",
      "(:epoch, 7, :trn, 0.9974333333333333, :tst, 0.9923)\n",
      "(:epoch, 8, :trn, 0.9968666666666667, :tst, 0.9918)\n",
      "(:epoch, 9, :trn, 0.9977166666666667, :tst, 0.9919)\n",
      "(:epoch, 10, :trn, 0.99445, :tst, 0.9884)\n",
      " 33.505476 seconds (12.59 M allocations: 4.384 GiB, 1.26% gc time)\n"
     ]
    }
   ],
   "source": [
    "Optimizers.main(\"--optim Rmsprop --lr 0.001 --rho 0.9 --eps 1e-6\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "```\n",
       "Adam(;lr=0.001, gclip=0, beta1=0.9, beta2=0.999, eps=1e-8)\n",
       "update!(w,g,p::Adam)\n",
       "```\n",
       "\n",
       "Container for parameters of the Adam optimization algorithm used by [`update!`](@ref).\n",
       "\n",
       "Adam is one of the methods that compute the adaptive learning rate. It stores accumulated gradients (first moment) and the sum of the squared of gradients (second).  It scales the first and second moment as a function of time. Here is the update formulas:\n",
       "\n",
       "```\n",
       "m = beta1 * m + (1 - beta1) * g\n",
       "v = beta2 * v + (1 - beta2) * g .* g\n",
       "mhat = m ./ (1 - beta1 ^ t)\n",
       "vhat = v ./ (1 - beta2 ^ t)\n",
       "w = w - (lr / (sqrt(vhat) + eps)) * mhat\n",
       "```\n",
       "\n",
       "where `w` is the weight, `g` is the gradient of the objective function w.r.t `w`, `lr` is the learning rate, `m` is an array with the same size and type of `w` and holds the accumulated gradients. `v` is an array with the same size and type of `w` and holds the sum of the squares of the gradients. `eps` is a small constant to prevent a zero denominator. `beta1` and `beta2` are the parameters to calculate bias corrected first and second moments. `t` is the update count.\n",
       "\n",
       "If `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`.  If `gclip==0` no scaling takes place.\n",
       "\n",
       "Reference: [Kingma, D. P., & Ba, J. L. (2015)](https://arxiv.org/abs/1412.6980). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 1–13.\n"
      ],
      "text/plain": [
       "\u001b[36m  Adam(;lr=0.001, gclip=0, beta1=0.9, beta2=0.999, eps=1e-8)\u001b[39m\n",
       "\u001b[36m  update!(w,g,p::Adam)\u001b[39m\n",
       "\n",
       "  Container for parameters of the Adam optimization algorithm used by \u001b[36mupdate!\u001b[39m.\n",
       "\n",
       "  Adam is one of the methods that compute the adaptive learning rate. It stores accumulated gradients (first moment) and the sum of the squared of\n",
       "  gradients (second). It scales the first and second moment as a function of time. Here is the update formulas:\n",
       "\n",
       "\u001b[36m  m = beta1 * m + (1 - beta1) * g\u001b[39m\n",
       "\u001b[36m  v = beta2 * v + (1 - beta2) * g .* g\u001b[39m\n",
       "\u001b[36m  mhat = m ./ (1 - beta1 ^ t)\u001b[39m\n",
       "\u001b[36m  vhat = v ./ (1 - beta2 ^ t)\u001b[39m\n",
       "\u001b[36m  w = w - (lr / (sqrt(vhat) + eps)) * mhat\u001b[39m\n",
       "\n",
       "  where \u001b[36mw\u001b[39m is the weight, \u001b[36mg\u001b[39m is the gradient of the objective function w.r.t \u001b[36mw\u001b[39m, \u001b[36mlr\u001b[39m is the learning rate, \u001b[36mm\u001b[39m is an array with the same size and type of \u001b[36mw\u001b[39m\n",
       "  and holds the accumulated gradients. \u001b[36mv\u001b[39m is an array with the same size and type of \u001b[36mw\u001b[39m and holds the sum of the squares of the gradients. \u001b[36meps\u001b[39m is a\n",
       "  small constant to prevent a zero denominator. \u001b[36mbeta1\u001b[39m and \u001b[36mbeta2\u001b[39m are the parameters to calculate bias corrected first and second moments. \u001b[36mt\u001b[39m is the\n",
       "  update count.\n",
       "\n",
       "  If \u001b[36mnorm(g) > gclip > 0\u001b[39m, \u001b[36mg\u001b[39m is scaled so that its norm is equal to \u001b[36mgclip\u001b[39m. If \u001b[36mgclip==0\u001b[39m no scaling takes place.\n",
       "\n",
       "  Reference: Kingma, D. P., & Ba, J. L. (2015) (https://arxiv.org/abs/1412.6980). Adam: a Method for Stochastic Optimization. International\n",
       "  Conference on Learning Representations, 1–13."
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@doc Adam"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "optimizers.jl (c) Ozan Arkan Can and Deniz Yuret, 2016. Demonstration of different sgd based optimization methods using LeNet.\n",
      "opts=(:atype, \"KnetArray{Float32}\")(:gamma, 0.95)(:rho, 0.9)(:eps, 1.0e-8)(:batchsize, 100)(:beta1, 0.9)(:iters, 6000)(:beta2, 0.95)(:epochs, 10)(:optim, \"Adam\")(:lr, 0.001)(:seed, -1)\n",
      "(:epoch, 0, :trn, 0.13488333333333333, :tst, 0.1404)\n",
      "(:epoch, 1, :trn, 0.9800166666666666, :tst, 0.9809)\n",
      "(:epoch, 2, :trn, 0.9897333333333334, :tst, 0.9875)\n",
      "(:epoch, 3, :trn, 0.99125, :tst, 0.9887)\n",
      "(:epoch, 4, :trn, 0.9933166666666666, :tst, 0.9894)\n",
      "(:epoch, 5, :trn, 0.99345, :tst, 0.9873)\n",
      "(:epoch, 6, :trn, 0.9950333333333333, :tst, 0.9876)\n",
      "(:epoch, 7, :trn, 0.9936833333333334, :tst, 0.9864)\n",
      "(:epoch, 8, :trn, 0.9949333333333333, :tst, 0.9865)\n",
      "(:epoch, 9, :trn, 0.9966666666666667, :tst, 0.9896)\n",
      "(:epoch, 10, :trn, 0.9970333333333333, :tst, 0.9889)\n",
      " 37.025818 seconds (13.83 M allocations: 4.422 GiB, 1.42% gc time)\n"
     ]
    }
   ],
   "source": [
    "Optimizers.main(\"--optim Adam --lr 0.001 --beta1 0.9 --beta2 0.95 --eps 1e-8\");"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 1.0.0",
   "language": "julia",
   "name": "julia-1.0"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.0.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
